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Abstract — Spam based attacks are 
growing in various social networks. 
Social network spam is a type of 
unwanted content that appears 
on social networking sites, such as 
Facebook, Twitter, Instagram, and 
others. This study used two 
categories of ensemble algorithms to 
build Twitter spam classification 
models. These algorithms worked by 
combining the strengths of 
individual learning algorithms and 
then reporting their total 
performances. In ensemble learning, 
models are formed from data based 
on the assumption that combining 
the output of multiple models is 
better than using a single classifier. 
Hence, this study used a labeled 
public dataset for machine learning- 
based Twitter spam detection. 
Several studies have investigated the 
classification of Twitter spam from 
the available datasets. However, 
there is a paucity of works that 
investigated how machine learning- 
based models built with 


homogenous and heterogeneous 
algorithms, behave in Twitter spam 
classification. ANOVA-F test was 
used for selecting the most 
promising features in the dataset. 
Then, homogeneous tree-based 
Random Forest (RF) ensemble and a 
heterogeneous ensemble vote 
classifier were employed for the 
classification of Twitter spam. Tree- 
based algorithms were used to build 
a homogeneous twitter spam 
detection model, while a 
combination of Support Vector 
Machine (SVM) and Decision Tree 
(DT) algorithms was used for 
building the heterogeneous model 
(using maximum voting classifier). 
The current study found that the 
performance of the Twitter spam 
detection models were promising. In 
al, the heterogeneous model 
recorded better performance with 
regards to accuracy, precision, 
recall, and F1-score than the model 
built with homogeneous base 
classifier. 
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Index Terms- ensemble 
classification, predictive accuracy, 
social network, Twitter spam 
detection 


I. Introduction 


Twitter is a very popular social 
networking platform in the internet 
space. It has suffered from several 
social spam attacks in recent years. 
Spam based attacks on social sites 
have been reported in literature in 
many ways [l]. Several studies 
reported that these attacks are 
carried out through bulk 
messages, profanity, insults, hate 
speech, malicious links, fraudulent 
reviews, fake friends, 
and personally identifiable 
information [1]. For the detection 
of some intrusions in the networks, 
Machine Learning (ML) techniques 
were found to be very powerful, as 
compared to signature based 
approaches [2]. Social spam is a 
type of spam content that appears 
on social media sites, such as 
Facebook, Twitter, and others and 
may include any website with user- 
generated content [3]. Generally, 
ML algorithms learn from a large 
set of existing data and make 
predictions about new data based 
on their leanring [4]. 


Several studies investigated the 
classification of Twitter spam from 


investigated how machine learning- 
based models, built with 
homogenous and heterogeneous 
algorithms, behave in Twitter spam 
classification. Homogenous 
ensembles are ensembles of the 
same classifiers, while 
heterogeneous ensembles are built 
from different base learners [5-7]. 


This study used the 
homogenous Random Forest (RF) 
ensemble as well as two 
heterogenous ensemble algorithms 
(Decision Trees and Support 
Vector Machine) based on 
maximum voting for the 
classification of Twitter spam. 
ANOVA-F test was used to handle 
the feature selection and ensemble 
approaches employed for the 
automatic classification of the 
evidence. This work seeks to 
extend the current authors’ 
previous study in the area of 
Twitter spam classification. 
Generally, ensemble algorithms are 
of different types. This work 
focuses on investigating how two 
different categories of ensembles 
can correctly classify Twitter spam 
in a better way. In a heterogeneous 
ensemble-based model, two single 
learners are used for building the 
ensemble. The algorithms used in 
the current heterogeneous ensemble 


; were Support Vector Machine 
the available datasets. However, PP 
there is a paucity of works that 
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(SVM) and Decision Tree (DT) 
algorithms. 


Ensemble algorithms work by 
combining the strengths of 
individual learning algorithms and 
then reporting their total 
performances. The current study 
used a dataset that is publicly 
available. This work seeks to 
extend a study by the authors in [8] 
which advocated the use of two 
separate homogeneous ensembles 
for Twitter spam classification. In 
the current study, the first ensemble 
was built from DTs as_ base 
learners, while the second one was 
built from SVM and DTs using 
maximum voting approaches. A 
Voting Classifier (VC) is an 
ensemble technique that combines 
the predictions of various models 
which together predict an output 
class based on their highest 
probability. DTs and SV M are all 
supervised learning algorithms used 
widely in various classification 
tasks. 


II. Related Work 


The authors in [9] proposed the 
use of a directed social graph 
model for the detection of Twitter 
spam. The methodology involved 
exploring the “follower” and 
“friend” relationships among users 
using a graph technique. Then, 
based on Twitter’s spam policy, 
novel content-based features and 


graph-based features were also 
proposed. The authors built a 
prototype to analyse the data set 
and evaluate the performance of the 
detection system. Classic 
evaluation metrics were used to 


compare the performance of 
various traditional classification 
methods. Experimental results 


showed that the Bayesian classifier 
had the best overall performance in 
term of F-measure. The results also 
showed that the spam detection 
system can achieve 89% precision. 


Furthermore, another study [10] 
used four Machine Learning (ML) 
techniques including Support 
Vector Machine (SVM), Neural 
Network (NN), Random Forest 
(RF), and Gradient Boosting (GB) 
to build four different Twitter spam 
detection models. The system 
works by using a structure which 
takes the client and tweet based 
highlights together with the tweet 
content to group the tweets. The 
study reported that Neural Network 
(NN) had a precision of 91.65% 
and outperformed the current 
arrangement by about 18%. 
Another system that focused on 
detecting spam more speedily 
through the creation of a large- 
scale annotated dataset for spam 
account detection on Twitter was 
proposed by [11]. The authors 
argued that the system is more 
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effective as compared to the 


existing approaches. 


The researchers in [12] built a 
large dataset of over 600 million 
public tweets. Then, they labelled 
up to 6.5 million spam tweets and 
extracted 12 lightweight features. 
Moreover, they applied a ground 
truth mechanism through the use of 
Trend Micros Web Reputation 
Service as proposed by [13]. 
Experiments were conducted using 
six ML algorithms under various 
conditions. It was argued that the 
approach is effective for Twitter 
spam detection. 


A similar study [14] carried out 
a review of spam attacks on the 
social media platforms. It focused 
on reporting the issues related to 
social spam detection, as well as 
the directions that future researches 
can take. The study reported that 
social media spam can be 
manifested in many ways, 
including bulk messages, 
profanity, insults, hate speech, 
malicious links, fraudulent reviews, 
fake friends, and personally 
identifiable information [14]. 


HI. Methodology 
A. Twitter Spam Dataset Source 


This study used a twitter spam 
dataset developed by [11]. The 
dataset is publicly available at 
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http://nsclab.org/nsclab/resources/. 
The files in the larger dataset were 


originally available in ARFF 
format. The first step in the 
methodology involved changing 


the files into CSV format. The 
feature set in the dataset is shown 
in Table I. Each line represents a 
tweet from the collection. The 
dataset was grouped into four and 
twelve light weight statistical 
features were generated, as shown 
in Table I. The last column in the 
dataset is the tweet class (spammer 
or non-spammer). Exploratory data 
analysis revealed that the dataset is 
binary in nature and it allows a 
machine learning-based model to 
classify Twitter tweets as spams or 
non-spams. As argued further by 
[11], two datasets were sampled for 
a continuous period of time, while 
the other two were randomly 
sampled. Despite the fact that the 
datasets contained a smaller feature 
sample space, selecting the most 
promising features for building the 
dataset is a good step. Feature 
subset selection is a process where 
the most promising features are 
automatically selected in the data 
that contribute most to the target 
variables. Thus, feature selection 
involves the process of selecting a 
subset of relevant features for use in 
machine learning-based model 
building. 
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Table I 
Dataset Features and their Description 
S/N _ Attribute Name Description of Attributes 
1 account age The age (days) of an account since its creation 
until the time of sending the most recent tweet 
2 no follower The number of followers of this twitter user 
3 no following The number of followings/friends of this 
twitter user 
4 no_userfavourites The number of favourites this twitter user 
received 
5 no lists The number of lists this twitter user added 
6 no tweets The number of tweets this twitter user sent 
7 no retweets The number of retweets 
8 no hashtag The number of hashtags included in this tweet 
9 no usermention The number of user mentions included in this 
tweet 
10 no urls The number of URLs included in this tweet 
11 no char The number of characters in this tweet 
12 no digits The number of digits in this tweet 
Table II 
Sample Size and Featureset in the Datasets 
S/N Derived Name for the Ins Pane) the Input Features P A 
Dataset Dataset in the Dataset spani 
1 TweetContinous1(Dataset1) 10,000 12 YES 
2 TweetRandom| (Dataset2) 10,000 12 YES 
3 TweetContinous2 
(Dataset3) 100,000 12 YES 
4 TweetRandom?2 (Dataset4) 100,000 12 YES 


The number of features and the number and types of tweet 
instances in each one of the patterns contained therein. The 
datasets are also depicted in Table datasets were labeled as Dataset 1, 
II. The values were obtained from Dataset 2, Dataset 3, and Dataset 4. 
the exploratory data analysis The dataset files were converted 
carried out. Twitter spam datasets from the ARFF format to the CSV 
were grouped into four based on format. Firstly, exploratory data 
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analysis was carried out. The 
essence was to understand the 
dataset patterns in a better way and 
be able to gain further insights 
regarding how to use the available 
samples and features. The 
characteristics of the four groups of 
the tweet datasets are summarised 
in Table II. 


Exploratory data analysis also 
revealed that each dataset contains 
numeric values as input features, 


with the target output as 
categorical. No missing values 
were found in the datasets and all 
existing values were used for 
making decisions about how to 
build spam detection models. 
Minimal pre-processing was 
carried out using the encoding of 
the target class. This is because the 
target class in a text in categorical 
format. 


B. Visualisation of the Patterns in the Dataset 
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Fig. 1.Distributions chart in dataset 1 
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Fig. 2. Distributions chart in dataset 2 
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Fig. 3. Distributions chart in dataset 3 
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Fig. 4. Distributions chart in dataset 4 


Figures 1-4 depict the one. Furthermore, the sample of 
distribution of data in each dataset. data types in each dataset is shown 
It is evident that data patterns differ below in Figure 5. 
from the first dataset to the fourth 


©) Data Frame in Tw1 - Notepad = oO x 

File Edit Format View Help 
| 242 e 0.1 0.2 0.3 62 .. 1 0.5 1.1 29 0.6 spammer 
e 834 1 5 e e 64 .. 1 e 1 22 e spammer 
1 978 a4 18 e e 114 ee = e 1 35 e spammer 
a 4990 e 26 e e 38340 =. 1 e 1 38 e spammer 
3 243 e e e e 72 .. 2 e 1 42 = spammer 
4 123 2 611 e e 319 -- 0 e 1 31 e spammer 
9994 1444 1327 273 751 53 26988 .. 0 e 1 70 @ non-spammer 
9995 266 114 63 e @ 55736 - @ e 1 114 é non-spammer 
9996 1068 364 470 83 e 3669 e e = 64 1 non-spammer 
9997 1042 881 1191 144 1 16798 ee 3 e 1 34 @ non-spammer 
9998 1549 244 198 79 1 16327 ..- 0 e 1 6e @ non-spammer 

Ln 1, Col 1 100% Windows (CRLF) UTF-8 
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It is evident from Figure 5 that 
the input attributes are in a numeric 
form, while the target class is 
categorical. Thus, the target class 
has to be encoded as a pre- 
processing step prior to building 
the Twitter spam model from each 
dataset. 


C. Feature Selection Technique 
Used 


The feature selection technique 
used in this study is ANOVA-F 
test. The choice of algorithm is 
based on the suitability of the said 
technique in view of the 
availability of numerical input 
variables and a classification of 
target variable. The approach 
identified nine features as most 
relevant for Twitter spam 
classification. These features were 
settled for based on their ranking. 
The authors in [15] emphasised the 
essence of feature selection and 
feature extraction in machine 
learning-based studies. Despite the 
fact that the feature set in the 
selected dataset is not too large, 
this study considered it important 
to select the most promising 
features for building the Twitter 
spam detection models, so as to 
guide against the problem of using 
all features in machine learning- 
based model building. 
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D. Homogenous and 
Heterogeneous Twitter Spam 
Detection Models 


As argued by Benjamin et al. 
[3], machine learning-based social 
site spam detection can be binary 
class based or multiclass based. 
The twitter spam detection models 
developed in this study are binary, 


since the datasets are binary 
(spammer, non-spammer) in nature. 
The first model (homogenous 


model) was built from the default 
base learners of Random Forest 
(RF) algorithm called Decision 
Trees (DTs), while the second 
model (heterogeenous model) was 
built using a combination of 
Suppoert Vector Machine (SVM) 
and Logisitc Regression (LR). The 
result in the second model is a 
consequence of majority voting. 
All the base algorithms used are 
supervised learning algorithms. RF 
is an ensemble of DTs that make 
use of the bagging technique [16— 
18]. The algorithm creates DTs on 
data samples for prediction and 
selects the best result through 
voting. Given a set of Twitter 
tweets, the goal was to identify 
Twitter spam based on the patterns 
captured by the ML algorithms 
from the datasets. 
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Fig. 6. Methodological flow of activities in the two Twitter spam detection 


models 


Figure 6 is used to illustrate the 
different stages in the two machine 
learning-based Twitter spam 
detection models. Python was used 
for the implementation of various 
stages in the models. The basic 
stages in the machine learning- 
based model building, as argued by 
[19], were followed in model 
implementation. Since the problem 
at hand is of a binary type, the 
target was to accurately classify the 
Twitter spam evidence. The study 
used learning algorithms for 
automatically classifying the 
labeled datasets into spam and non- 
spam categories. The hyper 
parameters of the model were tuned 
each time until a better result was 
achieved. 


E. Evaluation Metrics 


The metrics used for evaluating 
the RF-based model in this study 
are accuracy, precision, recall, and 
Fl-score. Brief explanation of each 
of the metrics is as follows: 


i. Accuracy: The ratio of the 


number of correctly classified 
cases to the total numbe of 
cases under evaluation. 


ii. Precision: The ability of a 
classification model to return 
only relevant instances. 


Recall: The ability of the 
classifier to capture all the 
relevant instances. 


iii. 
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iv. Fl-score: The weighted average 
of the recall and precision of 
the respective class. The values 
of the metrics can be obtained 
by using equations (1) to (4). 


(i) Accuracy = (TP+TN) 
(TP+TN+FP+EN) (1) 


(11) Precision =_TP 


IV. Results 


The results of the RF-based 
Twitter spam detection model were 
recorded and they occupied four 
decimal places, as shown in Table 
Il. Similarly, the results of 
heterogeneous ensemble based on 
the voting method are shown in 
Table IV. The study used a 


(TP+FP) (2) 
wee _ repeatable train-test split approach 
(111)Recall = TP : : : 
oak in all scenarios for evaluating the 
(TP+FN) (3) i ae 
Twitter spam classification models. 
(iv) Fl-score = 2x (Precision X Recall) 
(Precision + Recall) 
(4) 
Table II 
Classification Results of the Homogenoeus RF-based Model 
S/N Learning Algorithm Metric Model 
Performances 
CASE 1 (5k continous-Dataset 1) 
1 Heterogenoeus Random Forest Algorithm Accuracy 0.9736 
2 Heterogenoeus Random Forest Algorithm ar ete uere 
3 Heterogenoeus Random Forest Algorithm Recall 0.9731 
4 Heterogenoeus Random Forest Algorithm Fl-score 0.9675 
CASE 2 (95k continous-Dataset 2) 
5 Homogenoeus Random Forest Algorithm Accuracy 0.9737 
6 Heterogenoeus Random Forest Algorithm o 0.2620 
7 Heterogenoeus Random Forest Algorithm Recall 0.9748 
8 Heterogenoeus Random Forest Algorithm Fl-score 0.9719 
CASE 3 (5k random-Dataset 3) 
9 Homogenoeus Random Forest Algorithm Accuracy 0.9504 
10 Heterogenoeus Random Forest Algorithm a ai oes 
11 Heterogenoeus Random Forest Algorithm Recall 0.9501 
School of System and Technology D U MT 11 
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S/N Learning Algorithm Metric Model 
Performances 
12 Heterogenoeus Random Forest Algorithm Fl-score 0.9439 
CASE 4 (95k random-Dataset 4) 
13 Homogenoeus Random Forest Algorithm Accuracy 0.9732 
14 Heterogenoeus Random Forest Algorithm ae Oaer 
15 Heterogenoeus Random Forest Algorithm Recall 0.9746 
16 Heterogenoeus Random Forest Algorithm Fl-score 0.9695 
Table IV 
Classification Results of the Heterogeneous Model Based on Maximum 
Voting 
S/N Learning Algorithm Metric Mog) 
Performance 
CASE 1 (5k continous-Dataset1) 
1 Heterogenoeus Ensemble Algorithm Accuracy 0.9922 
2 Heterogenoeus Ensemble Algorithm Een 0.9918 
Score 
3 Heterogenoeus Ensemble Algorithm Recall 0.9922 
4 Heterogenoeus Ensemble Algorithm Fl-score 0.9917 
CASE 2 (95k continous-Dataset2) 
5 Heterogenoeus Ensemble Algorithm Accuracy 0.9831 
6 Heterogenoeus Ensemble Algorithm Pew 0.9833 
Score 
7 Heterogenoeus Ensemble Algorithm Recall 0.9831 
8 Heterogenoeus Ensemble Algorithm Fl-score 0.9817 
CASE 3 (5k random-Dataset3) 
9 Heterogenoeus Ensemble Algorithm Accuracy 0.9970 
10 Heterogenoeus Ensemble Algorithm pag 0.9971 
11 Heterogenoeus Ensemble Algorithm Recall 0.9970 
12 Heterogenoeus Ensemble Algorithm Fl-score 0.9970 
CASE 4 (95k random-Dataset4) 
13 Heterogenoeus Ensemble Algorithm Accuracy 0.9987 
14 Heterogenoeus Ensemble Algorithm ey 0.9986 
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: ; ; Model 
S/N Learning Algorithm Metric Performance 
15 Heterogenoeus Ensemble Algorithm Recall 0.9987 
16 Heterogenoeus Ensemble Algorithm Fl-score 0.9986 
V. Discussion the four selected metrics of 
. isi all, and F1- 
The study carried out accuracy, precision, recall, and 
f . score. It was observed that the 
exploratory data analysis which . 
. : ‘ feature selection and ensemble 
provided useful information f : 
classification methods used 


regarding how to use the datasets to 
build Twitter spam detection 
models. Having gained better 
insights into four different datasets 
as released by [l1] through 
experimental analysis, two 
ensemble learning algorithms were 
used for building Twitter spam 
detection models. Then, ANOVA-F 
technique was used for selecting 
the most promising features. The 
selected attributes were used to 


build the two models. The 
algorithms used to build these 
models were based on 


homogeneous and heterogeneous 
approaches. The results obtained 
from the two Twitter spam 
classification models are described 
in Table II and Table IV, 
respectively. During 
experimentation, the current study 
used varying training and test-split 
ratios to achieve the validation of 
the models. Good results were 
recorded at the split ratios of 75:25 
for training and testing sets, 
respectively. The performance of 
the models was checked based on 


School of System and Technology 


contributed largely to the good 
performance of the two models. 
The results obtained for different 
cases remain promising for both 
homogenoeus and heterogeneous 
ensembles. However, it was 
observed that the model built with 


heterogeneous _machine-learning 
algorithms (Support Vector 
Machine and Decision Trees) 
outperformed the ones built with 
homogeneous algorithms (Tree- 
based). 


A. Conclusion 


A general introduction to the 
Twitter spam classification 
problem as well as the promises of 
using machine  learning-based 
techniques for the identification of 
Twitter spam attacks was made. 
Data pre-processing and feature 
selection approaches were used to 
feed the two ensemble algorithms 
with the data available in a good 
form. This study focused on 
building two different ensemble- 
based models in four different 
cases using four groups of Twitter 
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spam datasets. The datasets used in 
this study were binary in nature. 
Several experiments were carried 
out which invovled using the same 
type of base classifiers (Decision 
Tress) to build a RF-based model. 
Furthermore, SVM-based and DT- 
based classifers were used to build 
a heterogeneous model. During the 
experiments, varying random split 
ratios were used to achieve model 
validation. Good results were 
recorded at the split ratios of 75:25 
for training and testing sets, 
respectively. Good performance of 
the models was judged based on the 
four selected metrics of accuracy, 
precision, recall, and Fl-score. It 
was observed that the model built 
with heterogeneous machine 
learning-based algorithms 
outperformed the ones built with 
homgeneous (Tree-based) 
algorithms. 
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